Manning Inferring Sequential Structure Craig G . Nevill - Manning
نویسنده
چکیده
Structure exists in sequences ranging from human language and music to the genetic information encoded in our DNA. This thesis shows how that structure can be discovered automatically and made explicit. Rather than examining the meaning of the individual symbols in the sequence, structure is detected in the way that certain combinations of symbols recur. In speech and text, these repetitions form words and phrases. They can be concisely represented by a hierarchical context-free grammar, where each repetition gives rise to a rule. As well as exact repetitions, sequences often exhibit branching and looping structure. These can be inferred and visualised as an automaton. We develop simple, robust techniques that reveal interesting structure in a wide range of real-world sequences including music, English text, descriptions of plants, graphical figures, DNA sequences, word classes in language, a genealogical database, programming languages, execution traces, and diverse sequences from a data compression corpus. These techniques are useful: they produce comprehensible visual explanations of structure, identify morphological units and significant phrases, compress data, optimise graphical rendering, infer recursive grammars by recognising self-similarity, and infer programs.
منابع مشابه
Phrase Hierarchy Inference and Compression in Bounded Space
Text compression by inferring a phrase hierarchy from the input is a recent technique that shows promise both as a compression scheme and as a machine learning method that extracts some comprehensible account of the structure of the input text. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns from sequences have been put to such ...
متن کاملTitle A Public Digital Library based on Full - Text Retrieval : Collections and Experience
Current size About ten text collections containing up to a million pages each; One music collection containing ten thousand melodies Subject matter Computer science technical reports and bibliographies, humanitarian development information, indigenous peoples issues, public-domain English literature, and magazines Information sources Various national and international Interfaces WWW, including ...
متن کاملInferring Lexical and Grammatical Structure from Sequences
In a wide variety of sequences from various sources, from music and text to DNA and computer programs, two different but related kinds of structure can be discerned. First, some segments tend to be repeated exactly, such as motifs in music, words or phrases in text, identifiers and syntactic idioms in computer programs. Second, these segments interact with each other in variable but constrained...
متن کاملCompression and Explanation Using Hierarchical Grammars
This paper describes an algorithm, called SEQUITUR, that identifies hierarchical structure in sequences of discrete symbols and uses that information for compression. On many practical sequences it performs well at both compression and structural inference, producing comprehensible descriptions of sequence structure in the form of grammar rules. The algorithm can be stated concisely in the form...
متن کاملDetecting Sequential Structure
Programming by demonstration requires detection and analysis of sequential patterns in a user’s input, and the synthesis of an appropriate structural model that can be used for prediction. This paper describes SEQUITUR, a scheme for inducing a structural description of a sequence from a single example. SEQUITUR integrates several different inference techniques: identification of lexical subsequ...
متن کامل